Raman spectroscopy and topological machine learning for cancer grading

In the last decade, Raman Spectroscopy is establishing itself as a highly promising technique for the classification of tumour tissues as it allows to obtain the biochemical maps of the tissues under investigation, making it possible to observe changes among different tissues in terms of biochemical constituents (proteins, lipid structures, DNA, vitamins, and so on). In this paper, we aim to show that techniques emerging from the cross-fertilization of persistent homology and machine learning can support the classification of Raman spectra extracted from cancerous tissues for tumour grading. In more detail, topological features of Raman spectra and machine learning classifiers are trained in combination as an automatic classification pipeline in order to select the best-performing pair. The case study is the grading of chondrosarcoma in four classes: cross and leave-one-patient-out validations have been used to assess the classification accuracy of the method. The binary classification achieves a validation accuracy of 81% and a test accuracy of 90%. Moreover, the test dataset has been collected at a different time and with different equipment. Such results are achieved by a support vector classifier trained with the Betti Curve representation of the topological features extracted from the Raman spectra, and are excellent compared with the existing literature. The added value of such results is that the model for the prediction of the chondrosarcoma grading could easily be implemented in clinical practice, possibly integrated into the acquisition system.

Raman spectroscopy (RS) is a noninvasive optical technique sensitive to the molecular composition of biological tissues so that RS can be used to optically probe the molecular changes associated with disease tissues, making it possible to classify malignant cancer degrees 1 . Raman spectrum is a plot of scattered intensity as a function of the energy difference between the incident and scattered photons and is obtained by pointing a monochromatic laser beam at the tissue under investigation. Hence, the loss or gain in the photon energies corresponds to the difference in the final and initial vibrational energy levels of the molecules belonging to the specific spots of the tissue investigated. The difference between final and initial vibrational energy levels denote shifts in wavenumbers, which are unique for individual molecules resulting in specific peaks that are spectrally narrow and potentially associated with the vibration of a specific chemical bond in the molecules 2 .
Since the grading of cancer tissues is one of the main challenges for pathologists, RS is establishing itself as one of the most promising new techniques for supporting pathologists in making diagnoses as accurate as possible, avoiding or limiting as much as possible false positives and false negatives, unfortunately still commonly experienced by pathologists today, and increasing the overall accuracy of diagnostic protocols [3][4][5][6][7] . Recently, RS has been applied to chondrogenic tumour classification with excellent results 8 . Chondrogenic tumours are the second worldwide largest group of bone tumours, whose histologic pattern suggests a deep relationship to hyaline cartilage. Chondrosarcomas are tumours whose malignant cells produce a cartilaginous matrix. When they occur in previously normal bones, they are generally classified as primary chondrosarcomas. At the same time, secondary chondrosarcomas result from the malignant transformation of a benign cartilaginous lesion. They are classified into three malignant degrees, the first degree (CS G1), the second one (CS G2) and the third one (CS G3). In addition to such three degrees, Enchondroma (EC) is a noncancerous version. Distinguishing between EC and CS G1 is a rather critical issue for pathologists, generating many false positive and false negative diagnoses 9,10 . In

Materials and methods
Data acquisition. The data acquisition was carried out with a Thermo Fisher Scientific DXR2xi Raman microscope. A total of 10 patients, who were being treated at the Institution, Azienda Ospedaliera Universitaria Pisana, Pisa, were enrolled in the study under the Ethical Committee agreement. Details can be found in the paper 8 . Formalin-fixed paraffin-embedded tumour tissue sections (e.g. in Fig. 1) were collected on glass slides and subsequently submitted to RS analysis after the dewaxing step (e.g. in Fig. 2). The protocol to remove paraffin and formalin has provided the immersion of the histopathological sections in a series of two baths of xylene for 10 min, respectively, and then washing the sections in PolyButylene Succinate (PBS) to remove residual formalin. Moreover, to give an idea of the variability of the datasets, Fig. 3 shows the Raman spectra coming from Dataset 1 (Fig. 3a) and from Dataset 2 (Fig. 3b).
The Raman spectroscopy measurements were configured based on the following experimental parameters: laser wavelength 532 nm; power laser of 5-10 mW; 400-3400 cm −1 full range grating; 10×, 50× and 100× objectives; 25 µm pinhole; 5 (FWHM) cm −1 spectral resolution. Integration time for recording a Raman spectrum was 1 s and 10 scans for any spectrum. As a first step, the tissue morphology overview was carried out to identify the regions of interest with the collection of a number of mosaic images at low (10×) and intermediate (50×) magnification. Thus, the acquisition of Raman spectra was carried out with a 100× objective. Optimization of signal-to-noise ratio and minimization of sample fluorescence were obtained through preliminary measurements in order to set the best experimental parameters. Multiple measurements were performed in different regions within the various samples, in order to assess intra-sample variability. In turn, no pre-treatment of the samples was necessary before Raman measurements. Minimal preprocessing, including background removal and baseline application, was performed using the tools of the DXR2xi GUI, and a 5th order polynomial correction was used to compensate for the tissue fluorescence. Peaks were identified with specific tool support by Omicron 9.0 software.
Raman hyperspectral chemical maps ranging from 50 × 50 µm 2 (step size 1 µm ) to approximately 200 × 200 µm 2 (step size 4 µm ), recording several hundreds of spectra per map were collected. Raman maps provide the fundamental advantage of being able to localize Raman spectra to specific locations, providing local information about chemical composition.
Step sizes were chosen to have a collection time for each map less than 7 h for all the maps.
Ten supplemental spectra have been acquired, making use of an Xplora Plus (Horiba) in a similar experimental setup and preprocessing procedure in order to test the classification method on never seen data samples. This way, the results of the final test reported at the end of the "Results" section show that the classification method proposed is neither subject-dependent nor vendor-specific (DXR Thermo Fisher data for model training, Xplora Horiba data for final model testing).

Data analysis.
The core idea of our study is to employ the many tools of Topological Data Analysis (TDA) and machine learning (ML) to perform classification in the dataset of Raman spectra described in the previ- www.nature.com/scientificreports/ ous section. The concept of using topological and geometrical ideas in medical data is not a novel one and has already demonstrated substantial potential through multiple research papers [12][13][14][15][16][17][18] . In our approach, we evaluated the effectiveness of topological features by combining them with established machine learning algorithms, including support vector classifiers, random forest classifiers, and Ridge regressions. The reason why we preferred general machine learning algorithms to deep learning is twofold. Firstly, NNs are highly task dependent, whereas the adopted processing pipeline 19 based on TDA and ML is intended to be very general. Secondly, for completeness, a CNN was trained on the persistence images obtained in "Combining TDA and ML: the classification pipeline" section. Since the results were not at all satisfactory, this experiment was excluded from this work.
Mathematical background. This section is mainly devoted to the description of TDA, a relatively new branch of applied mathematics that aims to bridge the gap between computational topology and discrete Morse theory in the study of high dimensional data. The interested reader can find more information about this topic here 20,21 . More precisely, we introduce Persistent Homology (PH) as one of the main concepts of TDA. Roughly speaking, PH studies the geometry of spaces by looking at the evolution of k-dimensional holes at different scales. It keeps track of the appearance and disappearance of such holes, which are the topological features, in the form of intervals (birth, death) . The persistence of a topological feature is the span of its detectability, and it is a measure of its importance. In particular, features with a longer lifespan are more likely to be key features in describing the shape of the data space, while features with a short lifespan can often be assimilated to noise. The collection of the intervals (birth, death) is called the Persistence Diagram (PD). Mathematically, a persistence diagram is a multiset, which is a set where elements can appear multiple times, i.e. each element has a multiplicity. Different metrics can be defined in the space of persistence diagrams. Notwithstanding the precise mathematical definition of these metrics (for which we refer the reader to 20,21 ), an essential property is that the process that associates www.nature.com/scientificreports/ a PD with data is stable with respect to these metrics. This means that a small perturbation in the data yields a small perturbation of the associated PD. This property is of fundamental importance in applications because it guarantees robustness against noise and repeatability. The main drawback of PDs is that the space of multisets lacks fundamental properties required in a machine learning context. For this reason, a number of representation methods have been devised in order to exploit the PDs' expressiveness in ML algorithms. For more information on such representation methods, we refer the reader to these papers [22][23][24][25] . The key idea of all these methods is to embed the space of persistence diagrams in a more broad Hilbert space in a stable way, i.e. to vectorize the PD. After this last step, we are able to exploit the topological features extracted by persistent homology directly in a machine learning algorithm. Figure 4 shows the classical paradigm for topological data analysis.
Combining TDA and ML: the classification pipeline. This section is devoted to the description of the topological pipeline employed in this study. We refer to Fig. 5 for a general scheme of our approach. The classification pipeline consists of an automatic grid search for the optimal choice of (i) PH-base representation of the input data; (ii) ML classifier for cancer staging. Such a pipeline has already been presented in this article 19 , where a variety of tests on benchmark datasets were carried out. The present work describes its first application to experimental data. See Fig. 6 for a graphical example of our pipeline. The first step of the pipeline is to compute the PDs from the Raman Spectra. In doing so, we chose the Vietoris-Rips filtration 27 . In this approach, each point of the spectra is treated as a point in the Euclidean space R 2 . We grow balls centered at each point of the signal and when i balls intercept, an i − 1 simplex is added to the simplicial complex with birth value r, the current radius. An alternative approach might have been to use lower star filtration 20,28 . Without going into details, since this filtration does not generate points in H 1 for 1D signals, Vietoris-Rips was preferred. Hence, starting from the Raman spectra ( Fig. 6a), restricted to the wavenumber range 400-1800 cm −1 , the persistence diagram of homology in dimension 0 and 1 is computed with a Vietoris Rips filtration using the python Ripser package 29 (Fig. 6b). The www.nature.com/scientificreports/ PD is therefore vectorized using four different vectorization methods with different combinations of parameters. More specifically, the PDs are vectorized using the following setup: It is important to highlight the fact that PDs produce points in different homological dimensions, and such information must be treated carefully. In more detail, following the rich TDA literature, we employed four different approaches to deal with information originating from different dimensions. In the first (resp. second) approach, only the points in dimension H i for i = 0 (resp. i = 1 ) are considered in the vectorization. In the third www.nature.com/scientificreports/ one, the actual homology dimension is neglected, and all the points are vectorized altogether regardless of the dimension in which they show up. Finally, in the fourth approach, H 0 and H 1 are vectorized separately, and then the corresponding vectors are concatenated. We will refer to these approaches as H 0 , H 1 , H 0 + H 1 (fused) and H 0 + H 1 (concat) respectively. Such vectors represent the input for different machine learning classifiers. The classifiers employed in our pipeline are: • Support Vector Classifier 30 (SVC) with RBF kernel and C ∈ {1, 2, 3, 5, 10, 20}; • Random Forest Classifier 31 (RFC) with #trees = 100; • Ridge Regression 32 (RR) with α = 1.
These are well-known and standard ML classifiers; in this work, we used the implementation of the Scikit-learn library 33 . The pipeline performs a grid search between the four approaches, the different vectorization and classifiers and returns the accuracy of each method for each of the ten runs of a leave-one-patient-out cross-validation 34 (LOPO). We stress that the design of our experimentation, including vectorizations, classifiers and LOPO, is motivated by two main reasons: (i) to achieve enough consistency with the previous work of the pipeline 19

Results
In this section, we are going to explore the results achieved by the pipeline described in "Combining TDA and ML: the classification pipeline" section. Due to the scarcity of data, we were able to perform a large number of experiments without any kind of computational restriction. For a more detailed description of the experimental data, please refer to "Data acquisition" section. In our first experiment ("Supervised results" section), we  www.nature.com/scientificreports/ performed supervised learning on Dataset 1. In more detail, in "Supervised results" section we trained different combinations of labels. Specifically, we experimented by training the classifier with 4 labels ("LOPO validation 4 labels" section), with 3 labels (EC vs. CS G1 vs. CS G2 and CS G3, "LOPO validation 3 labels (EC vs. CS G1 vs. CS G2 and G3)" section) and two binary classifiers (EC vs. CS, "LOPO validation 2 labels (EC vs. CS)" section; EC and CS G1 vs. CS G2 and CS G3, "LOPO validation 2 labels (EC, CS G1 vs. CS G2, G3)" section). As explained in "Introduction" section, the most clinically meaningful subdivision is the binary classification EC vs. CS. Nevertheless, other subdivisions that may be clinically useful were also investigated, as a proof-of-concept study of the applicability of our method to the more challenging task of supporting the pathologist in the tumour grading, i.e. the subdivision no cancer vs. mild cancer vs. severe cancer in "LOPO validation 3 labels (EC vs. CS G1 vs. CS G2 and G3)" section. Indeed, the results obtained with these subdivisions, although of limited validity due to the small number of patients, encourage to enlarge the experimentation in order to validate our method further. Moreover, in "Unsupervised clustering" section we performed unsupervised learning (clustering) on Dataset 1. In "Comparison with CLARA 11 " section, we compare with the state of the art from the paper 11 . Finally, in "Final test with new data" section, the best model of "LOPO validation 2 labels (EC vs. CS)" section (supervised, 2-label classification), trained on Dataset 1, was tested on Dataset 2, in order to assess its generalization capability. The pipeline takes the Raman spectra as input, computes the PDs by means of a Vietoris Rips filtration, vectorizes the PDs, and feeds such vectors to a machine learning classifier: basically, we start from a vector, and we end up with another vector of topological features. For this reason, in "Supervised learning without TDA" section, an ablation study has been carried out by feeding the Raman spectra directly to the machine learning classifiers, and comparing with the results achieved. We recall that the pipeline performs a grid search between a large number of methods {vectorization method, classifier}. Moreover, we treat separately the different homology dimensions discussed in "Combining TDA and ML: the classification pipeline" section. For this reason, in the following, for each experiment, we report both a table showing the best accuracy among all methods for each run of the LOPO validation and each homology dimension, as well as a table showing the best single method (as average accuracy) for each homology dimension.
Supervised results. In the first, somewhat naive experiment, we split all the spectra in training and test, not requiring to have all the spectra from the same patient in the training or in the test set (and not in both). Soon after, we opted for a leave-one-patient-out cross-validation (LOPO) approach to prevent overfitting. Moreover, we did not always carry out a 4-class classification, but also conducted 3 and 2-class studies in accordance with the existing literature.
Tenfold cross-validation with 4 labels. The first experiment uses all the 400 spectra from Dataset 1 and performs a tenfold cross-validation. We highlight that, in doing so, spectra coming from the same patient can occur both in the train dataset and in the test dataset. We report the classification accuracy of each run and each homology dimension in Table 1, while Table 2 reports the single best method for each homology dimension. Clearly, the www.nature.com/scientificreports/ accuracy results are extremely satisfying and fully justify a study of Raman spectra for chondrosarcoma tumour's degree of malignancy.
LOPO validation 4 labels. In this second experiment, we repeat the same experiment of "Tenfold cross-validation with 4 labels" section but using a leave-one-patient-out cross-validation approach. That is, in turn, all spectra from the same patient were used as test and all other spectra from other subjects as train. Table 3 reports the accuracy results over the course of the ten runs of the leave-one-patient-out and Table 4 the best single method in terms of accuracy. Clearly, there is a huge difference with the results of the previous experiment. We want to focus on a very important aspect of these results. Regardless of the homology dimension considered, runs 1, 2, 3 and 6 obtain high accuracy, runs 4 and 5 fluctuating accuracy while the accuracy is low in the remaining runs. Comparing these runs with the composition of Dataset 1, described in "Data acquisition" section, we note that in the train dataset of runs 7, 8, 9, 10 there were spectra from a single patient, while in runs 4 and 5 there were spectra from two patients, but one in significantly greater numbers than the other, and finally in the remaining runs there were always spectra from two patients in the train set. These facts, together with the excellent results obtained in "Tenfold cross-validation with 4 labels" section, lead us to think that our method is effective in correctly classifying spectra, but the variability of spectra from a single patient is low and at least the spectra of two patients in the train set are needed to learn the class features. Otherwise, the classifier only learns to recognize the patient.

LOPO validation 3 labels (EC vs. CS G1 vs. CS G2 and G3).
In our third experiment, we repeat a leave-onepatient-out approach but only with 3 labels, i.e., more specifically, class EC vs. CS G1 vs. CS G2 and CS G3. The results in Table 5 show a marked improvement over "LOPO validation 4 labels" section, but there is no stability in the method that achieves them; also, the accuracy of the best single method, shown in Table 6, is not as satisfactory.

LOPO validation 2 labels (EC vs. CS).
In this experiment, we again performed supervised learning with a leaveone-patient-out cross-validation approach with only two labels, specifically class EC versus CS, viz. the benign class versus the malignant ones. We highlight the fact that in this way the balance of classes is lost. In particular, the EC class represents only 25% of the dataset. We recall that in the literature this is the most meaningful subdividion. We report the accuracy results in Tables 7 and 8. The accuracy results are promising in both tables, being above 98% in the first one, and above 80% in the second one.

LOPO validation 2 labels (EC, CS G1 vs. CS G2, G3
). Finally, we repeat the experiment of "LOPO validation 2 labels (EC vs. CS)" section with different labels, that is classes EC, CS G1 vs. CS G2, G3, that is low/no malignancy versus high malignancy. The results are reported in Table 9 and Table 10, and again we obtain pretty good accuracies. Table 3. Accuracy of the pipeline with 4 labels and a leave-one-patient-out cross-validation approach.  www.nature.com/scientificreports/ Supervised learning without TDA. As specified in "Results" section, our pipeline transforms vectors (Raman spectra) into vectors (topological features) before feeding them into a machine learning classifier. Therefore, to justify a topological study of Raman spectra, we perform an ablation study: in this experiment, we directly feed the Raman spectra in the same classifiers as in the pipeline and compute the accuracy following the same validation scheme. For the sake of brevity, we only report the mean accuracy and best classifier for each label classifier. We report these results in Table 11. We can see from the results that the accuracies are slightly worse. Table 5. Accuracy of the pipeline with 3 labels and a leave-one-patient-out cross-validation approach.  www.nature.com/scientificreports/ Moreover, it is clear how insignificant these results are, e.g., by looking at the case with 4 labels, where we have a variance greater than the accuracy value itself. Hence, we conclude that TDA increases both accuracy and, notably, consistency.
Unsupervised clustering. For completeness, we also performed an unsupervised study on the vectors obtained from the pipeline. That is, instead of feeding the vectorizations of the PDs to a machine learning classifier, we used them as input to a clustering algorithm. Following the extensive machine learning literature, we used the following clustering algorithms, all from the Scikit-learn library: Affinity propagation 35 , Agglomerative clustering 36 43 and Gaussian mixtures 44 . We report the accuracy results for the experiment with 4 labels, 3 labels, 2 labels with class EC vs. CS and 2 labels with classes EC, CS G1 vs. CS G2, G3 in Tables 12, 13, 14 and 15 respectively. The most interesting aspect of these results is that decreasing in the number of labels does not produce an increase in accuracy, suggesting that indeed the features extracted by the TDA are sufficiently separated.
Comparison with CLARA 11 . "Introduction" section highlights that the results we should compare to those described in the papers 8,11 , which achieve remarkable performances in classifying the same 400 Raman spectra. Table 9. Accuracy of the pipeline with 2 labels (EC, CS G1 vs. CS G2, G3) and a leave-one-patient-out crossvalidation approach.   www.nature.com/scientificreports/ In details, CLARA is the best-performing method to achieve the 3-label classification. For a fair comparison, we applied the same preprocessing steps as in the paper 11 . Namely, we performed a data augmentation on Dataset 1 and chose a specific train-test split. The steps are described below.
Data augmentation. Following the paper 11 , Dataset 1 is augmented with the addition of some shifts in the spectra to simulate potential inaccuracies in the wavelength calibrations and with the addition of an additive noise sampled from a normal distribution. More specifically, each original spectrum is passed through: • additive Gaussian noise with mean 0 and standard deviation (std) max(spectrum)/1000; • additive Gaussian noise with mean 0 and std 1; • shift uniform noise in the interval [−5, 5]; • additive Gaussian noise with mean 0 and std max(spectrum)/1000 and shift uniform noise in the interval [−5, 5]; • additive Gaussian noise with mean 0 and std 1 and shift uniform noise in the interval [−5, 5].
In this way, the final Dataset 1 (augmented) is six times larger than the original Dataset 1.
Train-test split. When validating our method in the previous sections, we performed a cross-validation and patient-stratified scheme. For a fair comparison with the paper 11 , where a static split is performed, in this section, we carry out the same division. Basically, the train set is composed of the spectra coming from patients 1, 3, 5, 6, 8, 10 while the remaining spectra compose the test set (31, 24 and 99 spectra respectively for class EC, CS G1 and CS G2, G3).
Results. We point out that, having followed the same procedure of data augmentation and train-test partitioning as in the paper 11 , the results obtained in this section are comparable in all respects to those obtained by CLARA. The only difference lies in the random seed used for data augmentation, but we can assume this difference is negligible. Table 16 shows the results achieved in this setting. The accuracy has improved significantly, and these results are in line with, if not slightly better than, those obtained in the bibliography 11 . Final test with new data. Let us now describe the last type of experiments we performed. In order to assess the generalization capability of the best-performing classification models found through the experimentation described above, we tested the models on new data: Dataset 2. This dataset consists of 10 Raman spectra, one for each patient. Moreover, the acquisition of these spectra occurred at a different time than Dataset 1,    Figure 7 shows the confusion matrices of both these experiments. As a comparison, we repeated these two procedures in the 4 label classification. The classifier trained with a LOPO validation performed poorly, predicting the labels [0, 3, 3, 1, 1, 3, 0, 2, 1, 0], which results in an accuracy of 40%. We highlight that the misclassifications are in adjacent classes, except in one case. This is in agreement with what occurred in binary classification. Training on the entire Dataset 1 yields the labels [0, 2, 2, 1, 1, 2, 0, 0, 3, 0], which corresponds to an accuracy of 80%. The improvement is remarkable and, together with the limitations of the dataset, shows the potential of this method in large-scale applicability. Figure 8 reports the confusion matrices for both of these experiments. Finally, following the great improvement achieved with data augmentation in "Data augmentation" section, we tried to train the classifiers on Dataset 1 (augmented) and test on Dataset 2,    www.nature.com/scientificreports/ but the results were the same. Also, performing the same data augmentation described in "Data augmentation" section on Dataset 2 basically did not change the results, as can be seen in Table 17.

Discussion and conclusions
The present work is devoted to highlighting the potential of combining topological data analysis and machine learning in the very challenging scenario of biochemical understanding of cancer grading through Raman spectroscopy. In order to better appreciate the contribution of the combination of TDA and ML in the proposed pipeline, we performed two ablation studies: • Supervised learning using the Raman spectra as vectors, fed to the ML part of the pipeline; • Unsupervised clustering of the vectorizations of the topological PDs (Agglomerative clustering, BIRCH, DBSCAN, K-Means, Mini batch K-Means, Mean-shift, OPTICS, Spectral clustering, and Gaussian mixtures).
Regarding the first study, results show that using the TDA descriptors is much more convenient (for both accuracy and significance) than classifying by the Raman spectra as vectors, in all the classification tasks (4, 3 and 2 labels). On the other hand, even when replacing the ML part of the pipeline with a standard clustering, we noted that the best-performing clustering method is affinity propagation, which has accuracy performances which are only slightly better than those achieved by the complete pipeline. Also, results are very promising with respect to the state of the art, as the classification accuracy outperforms the best results reported in literature 11 , as shown in "Comparison with CLARA 11 " section. Due to the size of the dataset, our results should be considered preliminary but, thanks to the strict validation scheme used, significant. Indeed, the classification accuracy proves to be excellent also when the pipeline is applied to Dataset 2, which is made of never-seen data acquired with a hardware device different from the one used to acquire Dataset 1. Moreover, in every experiments even a slight widening of the training dataset results in a great improvement in the accuracy and specificity (e.g. training on the whole Dataset 1 instead of a LOPO validation), showing excellent potential, both to classify new data and to be vendor neutral (with respect to the RS acquisition). Regarding the applicability of our approach, the proposed pipeline provides a classification model that can be easily integrated into a workflow (as already done in the commercial workstation as for the preprocessing modules), enabling the reduction of time and cost of the grading of cancerous tissues. In conclusion, we are convinced that topological machine learning methods are able to support the classification of data from Raman spectroscopy; also, we plan to perform more conclusive experimentation in the near future, in order to improve not only the classification accuracy of the proposed pipeline, but more importantly to increase the stability of the results with respect to the choice of the best combination {vectorization method, classifier}.

Data availability
The request for datasets, both raw and processed data, generated during the present study can be agreed and made directly to the corresponding author. www.nature.com/scientificreports/